The following two answers are provided in the two pictures
Question 1 & Part A of Question 2
Question 2-Part B
Using the majority vote and average probability approach, what is the final classification under these two approaches?
red <- c(.1,.15,.2,.2,.55,.6,.6,.65,.7,.75)
mean(red)
## [1] 0.45
sum(red>.5)
## [1] 6
Use recursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observations.
Apply cost complexity pruning to the the large tree in order to obtain a sequence of best subtrees as a function of \(\alpha\).
Use K-fold cross-validation to choose \(\alpha\). That is, divide the training observations into K-folds. For each \(k = 1,.....,k\):
Repeat steps 1 and 2 on all but the \(k^{th}\) fold on the training data.
Evaluate the mean squared prediction error on the data in the left-out \(k^{th}\) fold, as a function of \(\alpha\).
Average the results for each value of \(\alpha\), and pick \(\alpha\) to minimize the average error.
set.seed(10)
library(ISLR2);library(tree)
## Warning: package 'tree' was built under R version 4.1.3
# Splitting data into training and test set
train <- sample(1:nrow(Carseats), nrow(Carseats)/2)
test <- Carseats[-train,]
# Fitting a regression tree on the training set using sales as response variable
set.seed(10)
tree_carseat <- tree(Sales~.,data = Carseats, subset = train)
plot(tree_carseat)
text(tree_carseat, pretty = 1)
#output of carseat regression tree
summary(tree_carseat)
##
## Regression tree:
## tree(formula = Sales ~ ., data = Carseats, subset = train)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "Age" "CompPrice" "Population"
## Number of terminal nodes: 14
## Residual mean deviance: 2.378 = 442.2 / 186
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.33500 -1.02300 0.06757 0.00000 0.96470 3.93500
# Obtain Test MSE
set.seed(10)
seat_pred <- predict(tree_carseat,newdata = test)
mean((seat_pred-test$Sales)^2)
## [1] 5.202316
Part B Answers:
Using the regression tree I obtained 5 variables for construction with 20 terminal nodes. The best predictor according to the tree is Shelf location being the top split. Using my validation set I obtained a test MSE of 5.202 which was actually lower than my training MSE.
# Using cross validation to determine optimal level of tree complexity.
set.seed(10)
cv_seat <- cv.tree(tree_carseat)
plot(cv_seat$size, cv_seat$dev, type = "b")
# Pruning Tree with 5 terminal nodes
prune_seat <- prune.tree(tree_carseat, best = 5)
plot(prune_seat)
text(prune_seat, pretty = 0)
#Calculating Test MSE with pruned tree
prune_yhat <- predict(prune_seat, newdata = test)
mean((prune_yhat-test$Sales)^2)
## [1] 5.00269
Part C Answer: The optimal level of complexity only had 5 terminal nodes based on the smallest cross validation error. Yes, pruning improved test MSE as you can see \(5.002 < 5.20\).
#Performing bagging approach
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.1.3
## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
bag_seat <- randomForest(Sales~.,data = Carseats, subset = train, mtry = 10, importance = TRUE)
bag_yhat <- predict(bag_seat, newdata = test)
mean((bag_yhat - test$Sales)^2)
## [1] 3.002559
bag_seat
##
## Call:
## randomForest(formula = Sales ~ ., data = Carseats, mtry = 10, importance = TRUE, subset = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 10
##
## Mean of squared residuals: 2.745786
## % Var explained: 68.41
#Using importance() function to determine most important variables
importance(bag_seat)
## %IncMSE IncNodePurity
## CompPrice 16.7056094 137.663159
## Income 6.1726270 82.647872
## Advertising 6.7667529 61.038198
## Population 1.3436274 61.170907
## Price 52.3596661 418.926579
## ShelveLoc 75.4088192 719.467211
## Age 22.6609712 164.147571
## Education 0.5010586 37.596310
## Urban -2.4362480 6.362525
## US 2.5465344 7.943683
varImpPlot(bag_seat)
Part D Answer:
The test MSE I obtained from bagging returned \(2.943\) which is lower than the basic regression and pruned tree. This makes sense because bagging over 500 trees eliminates variance and bias since you are bootstrapping 500 different trees using 500 bootstrapped training sets, and then average the resulting predictions. This results in a lower variance model, thus reducing test MSE.
Using the importance function, I found that Shelf-location and price of car seat are the most important variables.
#Using random forest to analyze data; I will choose m to equal 4
bag_random <- randomForest(Sales~., data = Carseats, subset = train, mtry = 4, importance = TRUE)
bag_random
##
## Call:
## randomForest(formula = Sales ~ ., data = Carseats, mtry = 4, importance = TRUE, subset = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 2.831533
## % Var explained: 67.42
bag_ran_yhat <- predict(bag_random, newdata = test)
mean((bag_ran_yhat-test$Sales)^2)
## [1] 3.001544
# Using importance() function
importance(bag_random)
## %IncMSE IncNodePurity
## CompPrice 14.0562500 147.04907
## Income 2.9759045 102.36762
## Advertising 6.1148252 82.51048
## Population -0.2430633 88.88349
## Price 37.5451519 367.81849
## ShelveLoc 52.9994335 581.38917
## Age 18.9617144 201.52419
## Education 1.1375056 59.41032
## Urban -0.6638014 10.99337
## US 3.6517779 14.98082
varImpPlot(bag_random)
Part E Answers:
Using m = 4, meaning only four out of 10 variables can be considered at each split resulted in a test mse of 2.9901, which was higher than the bagging approach. Once again the most important variables are shelf-location and price of the car seat.
#Understanding the effect of M
bag_random_6 <- randomForest(Sales~.,data = Carseats, subset = train, mtry = 6, importance = TRUE)
bag_random_yhat <- predict(bag_random_6, newdata = test)
mean((bag_random_yhat - test$Sales)^2)
## [1] 2.902661
Part E Answers:
With four variables considered at each split I obtained a test mse of 2.963, with 6 variables the test mse was 2.9723, and using the normal bagging approach I obtained a test MSE of 2.942.
The test MSE using random forests is slightly higher than the bagged approach. This could occur because their isn’t a significant correlation between the variables in the data set, resulting in slightly higher variance in the random forest method.